Published July 23, 2024
By Kimberly Mann Bruch, SDSC Communications
The U.S. National Science Foundation (NSF) has awarded a $5.3 million supplement to the San Diego Supercomputer Center (SDSC) at UC San Diego to expand the artificial intelligence (AI) capabilities of Expanse, SDSC’s flagship supercomputer.
The award was made as part of the National AI Research Resource (NAIRR) pilot program, which aims to connect U.S. researchers and educators to computational, data, and training resources needed to advance AI research and research that employs AI.
The expansion adds 34 Dell XE9640 servers to Expanse, each containing four NVIDIA H100 GPUs, two 36-core Intel Sapphire Road processors, 1 TB of RAM and 6.4 TB of local NVMe storage. The funding also provides for the purchase of 3 PB of cloud-compatible Ceph storage and support to cover two years of operation.
“The NAIRR expansion will enable researchers to train complex models that can be applied to a wide range of real-life applications such as weather prediction, image classification, natural language processing, health care, software development, smart manufacturing and self-driving cars,” said Principal Investigator Michael Norman. “State-of-the-art models can now have many billions of tunable parameters and would not be possible without access to the most advanced computer hardware.”
Expanse has a modular design based on SDSC Scalable Compute Units (SSCUs), each consisting of one rack containing CPU and/or GPU nodes and networking to provide full fat-tree connectivity within the rack along with connections to the other racks and distributed files systems. This design makes it straightforward to grow the system – the NAIRR supplement will be the fourth expansion after the addition of SSCUs for SDSC’s industry partners program, Partnership for High Throughput Computing (PATh) and Center for Western Weather and Water Extremes (CW3E).
“By deploying the new hardware as an addition to Expanse rather than a standalone system, the expansion leverages existing infrastructure for power management and cooling,” Norman said. “Operations are also simplified since the same cluster management system and batch scheduler can be used.”
Share